Transient and Intermittent Fault Recovery without Rollback
نویسندگان
چکیده
Increasing chip density combined with heightened reliability expectations has spawned greater interest in fault tolerant design. In recent years, research into rollback and retry techniques has established them as an e ective approach to recovery from transient and intermittent faults. For applications with strict timing requirements, however, the high error latency inherent in retry approaches is unacceptable. We have developed an alternative recovery method with strict error latency boundaries. In addition, the bulky state storage hardware required in rollback designs has been eliminated. The result is a more e cient, more broadly applicable approach to fault tolerant design.
منابع مشابه
Encore: Low-Cost, Fine-Grained Transient Fault Recovery
To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. However, the pursuit of faster processors and longer battery life has come at the cost of device reliability. Given the rise of processor (un)reliability as a first-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques...
متن کاملOrphan-Free Consistent Condition for Log-Based Checkpointing and Rollback Recovery Scheme
The fundamental goal of the log-based fault-tolerant scheme is to bring the system into a consistent global state without any orphan inconsistence. However, the existing Alvisi’s No-Orphans Consistency Condition is only sufficient on condition that the set of local checkpoints of failure processes keep consistent always. Independent of the specific log-based checkpointing and rollback-recovery ...
متن کاملCheckpointing and Migration of parallel processes based on Message Passing Interface
This paper presents a Checkpoint-based Rollback Recovery and Migration System for Message Passing Interface, ChaRM4MPI, for Linux Clusters. Some important fault tolerant mechanisms are designed and implemented in this system, which include coordinated checkpointing protocol, synchronized rollback recovery, process migration, and so on. Owing to ChaRM4MPI, the node transient faults can be recove...
متن کاملFault-tolerant sub-lithographic design with rollback recovery.
Shrinking feature sizes and energy levels coupled with high clock rates and decreasing node capacitance lead us into a regime where transient errors in logic cannot be ignored. Consequently, several recent studies have focused on feed-forward spatial redundancy techniques to combat these high transient fault rates. To complement these studies, we analyze fine-grained rollback techniques and sho...
متن کاملThe FTMPS { Project : Design and Implementation of Fault { Tolerance Techniques for Massively Parallel Systems 1
The FTMPS-project provides a solution to the need for fault{ tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998